16 research outputs found

    Can you tell a face from a HEVC bitstream?

    Full text link
    Image and video analytics are being increasingly used on a massive scale. Not only is the amount of data growing, but the complexity of the data processing pipelines is also increasing, thereby exacerbating the problem. It is becoming increasingly important to save computational resources wherever possible. We focus on one of the poster problems of visual analytics -- face detection -- and approach the issue of reducing the computation by asking: Is it possible to detect a face without full image reconstruction from the High Efficiency Video Coding (HEVC) bitstream? We demonstrate that this is indeed possible, with accuracy comparable to conventional face detection, by training a Convolutional Neural Network on the output of the HEVC entropy decoder

    DFTS: Deep Feature Transmission Simulator

    Get PDF
    Collaborative intelligence is a deployment paradigm for deep AI models where some of the layers run on the mobile terminal or network edge, while others run in the cloud. In this scenario, features computed in the model need to be transferred between the edge and the cloud over an imperfect channel. Here we present a simulator to help study the effects of imperfect packet-based transmission of deep features. Our simulator is implemented in Keras and allows users to study the effects of both lossy packet transmission and quantization on the accuracy

    ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages

    Full text link
    Building multi-modal language models has been a trend in the recent years, where additional modalities such as image, video, speech, etc. are jointly learned along with natural languages (i.e., textual information). Despite the success of these multi-modal language models with different modalities, there is no existing solution for neural network architectures and natural languages. Providing neural architectural information as a new modality allows us to provide fast architecture-2-text and text-2-architecture retrieval/generation services on the cloud with a single inference. Such solution is valuable in terms of helping beginner and intermediate ML users to come up with better neural architectures or AutoML approaches with a simple text query. In this paper, we propose ArchBERT, a bi-modal model for joint learning and understanding of neural architectures and natural languages, which opens up new avenues for research in this area. We also introduce a pre-training strategy named Masked Architecture Modeling (MAM) for a more generalized joint learning. Moreover, we introduce and publicly release two new bi-modal datasets for training and validating our methods. The ArchBERT's performance is verified through a set of numerical experiments on different downstream tasks such as architecture-oriented reasoning, question answering, and captioning (summarization). Datasets, codes, and demos are available supplementary materials.Comment: CoNLL 202

    A Dataset of Labelled Objects on Raw Video Sequences

    Get PDF
    We present an object labelled dataset called SFU-HW-Objects-v1, which contains object labels for a set of raw video sequences. The dataset can be useful for the cases where both object detection accuracy and video coding efficiency need to be evaluated on the same dataset. Object ground-truths for 18 of the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences have been labelled. The object categories used for the labeling are based on the Common Objects in Context (COCO) labels. A total of 21 object classes are found in test sequences, out of the 80 original COCO label classes. Brief descriptions of the labeling process and the structure of the dataset are presented

    Üç-elemanlı filtre ile kayıplı ve kayıpsız video kodlama için resim içi kestirim.

    No full text
    Video coders are primarily designed for lossy compression. The basic steps in modern lossy video compression are block-based spatial or temporal prediction, transformation of the prediction error block, quantization of the transform coefficients and entropy coding of the quantized coefficients together with other side information. In some cases, this lossy coding architecture may not be efficient for compression. For example, when lossless video compression is desirable, the transform and quantization steps are skipped. Or in lossy compression of synthetic video content (such as animations), the transform may be skipped for some of the blocks and the prediction error is quantized and entropy coded in those blocks. In these cases, the block-based spatial prediction (called intra prediction) cannot sufficiently decorrelate the pixels by itself and large prediction errors become more frequent. For the cases where the transform is skipped, the block-based prediction can be replaced with a more accurate pixel-by-pixel prediction since the original/reconstructed neighboring pixels inside the block will be readily available due to the lack of transform. This thesis explores pixel-by-pixel prediction methods based on 3-tap filtering which use three neighboring pixels for prediction according to a two-dimensional correlation model. Two of the proposed methods are designed for lossless intra coding, one with offline determined prediction weights and the other with online determined adaptive weights. The third proposed method uses the 3-tap filtering method for the transform skipped blocks in lossy intra coding. The proposed methods are implemented within the HEVC reference software and the experimental results indicate that the pixel-by-pixel spatial prediction method based on 3-tap filtering can improve the compression efficiency for both lossless and lossy coding. M.S. - Master of Scienc
    corecore